Fixed CUDA arch: OFF -> 52;61 #389

JohannesGaessler · 2023-07-14T17:40:13Z

Fixes the issue described in #338 where the __dp4a based CUDA kernels cause the model to produce garbage. The problem is that the cmake CUDA compute capabilities set in llama.cpp and ggml are different. In llama.cpp they are set to 52;61, the lowest allowed compute capability and the minimum compute capability for __dp4a. A GPU will automatically use the highest compute capability PTX code that was compiled. If only 5.2 PTX code is generated (the default) the fallback implementation in which 0 is returned is used for GPUs with compute capability >= 6.1 which causes the model to produce garbage outputs. Ideally I would have put the __CUDA_ARCH__ check outside the kernels but unfortunately this is not possible; __CUDA_ARCH__ is only available in device code.

Fixed CUDA arch: OFF -> 52;61

8006b31

ggerganov merged commit 90cb1ef into ggml-org:master Jul 14, 2023

JohannesGaessler mentioned this pull request Jul 17, 2023

CUDA: Quantized matrix matrix multiplication ggml-org/llama.cpp#2160

Merged

JohannesGaessler deleted the cuda-arch-fix branch October 3, 2024 09:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed CUDA arch: OFF -> 52;61 #389

Fixed CUDA arch: OFF -> 52;61 #389

JohannesGaessler commented Jul 14, 2023

Fixed CUDA arch: OFF -> 52;61 #389

Fixed CUDA arch: OFF -> 52;61 #389

Conversation

JohannesGaessler commented Jul 14, 2023